title: “52414: Home Exam”

date: “June 2rth, 2020”

output: html_document


Contents:



Q0.Submission Instructions (Please read carefully)

The exam will be submitted individually by uploading the solved exam Rmd and html files to the course moodle.

Please name your files as 52414-HomeExam_ID.Rmd and 52414-HomeExam_ID.html where ID is replaced by your ID number (do not write your name in the file name or in the exam itself).

The number of points for each sub-question is indicated next to it, with \(105\) points overall. The total grade will be at most \(100\).

Once you click on the moodle link for the home exam, the exam will start and you have three days (72 hours) to complete and submit it.

The exam will be available from June 2rth to July 19th. The last submission time is June 19th at 23:59.

You may use all course materials, the web and other written materials and R libraries.

You are NOT allowed to discuss any of the exam questions/materials with other students.

Data Analysis:

Some questions may require data wrangling and manipulation which you need to

decide on.

In questions involving real data, some data may be missing (‘NA’), wrong (e.g. negative values for positive variables)

or cause numerical problems (e.g. zeros which go into a log-transform). If you encounter such data, use your best judgment (e.g. excluding them, zeroing negative values, using \(log(1+x)\) etc.) and mention what you have done in your analysis.

Whenever possible, use objective and specific terms and quantities learned in class, and avoid subjective and general unquantified statements. For example:

Good: “We see a \(2.5\)-fold increase in the curve from Jan. 1st to March 1st”.

Bad: “The curve goes up at the beginning”.

Good: “The median is \(4.7\). We detected five outliers with distance \(>3\) standard deviations from the median”.

Bad: “The five points on the sides seem far from the middle”.

Presentation of Results:

Write your answers and explanations in the text of the Rmd file (not in the code).

The text of your answers should be next to the relevant code, plots and tables and refer to them, and not at a separate place at the end.

You need to explain every step of your analysis. When in doubt, a more detailed explanation is better

than omitting explanations.

Give informative titles, axis names and names for each curve/bar in your graphs.

In some graphs you may need to change the graph limits. If you do so, please include the outlier points you have removed in a separate table.

Add informative comments explaining your code

Sometimes Tables are the best way to present your results (e.g. when asked for a list of items). Exclude irrelevant

rows/columns. Display clearly items’ names in your Tables.

Show numbers in plots/tables using standard digits and not scientific display.

That is: 90000000 and not 9e+06.

Round numbers to at most 3 digits after the dot - that is, 9.456 and not 9.45581451044

Required libraries are called in the Rmd file. Install any library missing from your R environment. You are allowed

to add additional libraries if you want.

If you do so, please add them at the start of the Rmd file, right below the existing libraries, and explain what libraries you’ve added, and what is each new library used for.



Q1. Compute Areas Using Monte Carlo Sampling (25 pt)

Monte Carlo method is a useful tool for transforming problems of probabilistic nature into deterministic computations using the law of large numbers.

Assume that the radius \(r\) of the four overlapping white circles in the figure below is the last digit of your ID number (‘sifrat-bikoret’). If this digit is \(0\) set \(r=10\). The radius of the large circle is \(2r\).

  1. (5pt) Describe in a few sentences a Monte-Carlo-based algorithm for calculating the orange area. Aside from verbal explanations, you may use pseudo-code and/or mathematical symbols to explain your algorithm.


Answer
The Monte-Carlo-based algorithm for calculating an area is a method that uses
random sampling in order to estimate a parameter(or in our case an area).
The Monte-Carlo algorithm will work as follows:

  1. first, we will generate two random samples that will represent coordinates in the a [-2r,2r]^2 square

  2. than, according to our data we will check if the coordinates belong to the “orange area” or the four overlapping circles or the square(but not in the circle), and adding the coordinates to the right vector(or make a new random sample if the coordinates are not in the large circle) .

  3. repeat, 1 and 2 “n” times

  4. compute the area of the large circle.

  5. now we will notice that the number of coordinates in the orange area + the number of coordinates in the four overlapping circles = number of samples
    so the number of coordinates in the orange area divided by number of samples is the ratio of the orange area in the circle
    so area*the number of coordinates in the orange area/n = orange area.

  1. (9pt) Perform a Monte Carlo simulation based on your description from (a.), and estimate the orange area using \(n=50,000\) random samples.
# initialzing the start point for monte carlo with 50,000 samples
# and set vectors to save the points that genareted
n <- 50000
r <- 8
sam_vec_in_x<-vector()
sam_vec_in_y <- vector()
sam_vec_out_x <-vector()
sam_vec_out_y <- vector()
counter <-1
# using for loops and conditions to divaid the samples in to points in 
# the four cerciles  and points that are out of the four cercules 
while( counter< n) {
  sam_x <- runif(1,-2*r,2*r)
  sam_y <- runif(1,-2*r,2*r)
if( (sam_x-r)^2 + (sam_y)^2 <= r^2) {
  sam_vec_in_x<-c(sam_vec_in_x,sam_x)
  sam_vec_in_y <-c(sam_vec_in_y,sam_y)
  counter <- counter +1
}else if( sam_x^2 + (sam_y-r)^2 <= r^2) {
  sam_vec_in_x<-c(sam_vec_in_x,sam_x)
  sam_vec_in_y <-c(sam_vec_in_y,sam_y)
  counter <- counter +1
}else if((sam_x+r)^2 + (sam_y)^2 <= r^2) {
  sam_vec_in_x<-c(sam_vec_in_x,sam_x)
  sam_vec_in_y <-c(sam_vec_in_y,sam_y)
  counter <- counter +1
} else if( (sam_x^2) + (sam_y+r)^2 <= r^2) {
  sam_vec_in_x<-c(sam_vec_in_x,sam_x)
  sam_vec_in_y <-c(sam_vec_in_y,sam_y)
  counter <- counter +1
}
if(sam_x^2 + sam_y^2 <= 4*r^2 & (sam_x^2) + (sam_y+r)^2 >= r^2 & (sam_x+r)^2 + (sam_y)^2 >= r^2 &  sam_x^2 + (sam_y-r)^2 >= r^2 & (sam_x-r)^2 + (sam_y)^2 >= r^2 ){
  sam_vec_out_x <- c(sam_vec_out_x,sam_x)
  sam_vec_out_y <- c(sam_vec_out_y,sam_y)
  counter <- counter +1
}
   
}
# computing the the area 
area <- pi*(2*r)^2
orange_area <-area*(length(sam_vec_out_x)/50000)
round(orange_area,3)
## [1] 144.877
  1. (5pt) Plot the simulated points you have generated, with the points that fall inside the orange area in one color and the rest of the points in another color.
# creating data frames for the ggplot of monte carlo
data_in <- data_frame(sam_vec_in_x,sam_vec_in_y)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
data_out <- data_frame(sam_vec_out_x,sam_vec_out_y)

carlo <- ggplot(data=data_out,mapping = aes(sam_vec_out_x,sam_vec_out_y)) + geom_point(color="yellow") + coord_cartesian(xlim = c(-2*r, 2*r),ylim = c(-2*r,2*r))
carlo + geom_point(mapping = aes(sam_vec_in_x,sam_vec_in_y),data = data_in) + labs(title = "Monte-carlo simulated points",x="x",y="y")

  1. (6pt) Compute an estimate for the standard deviation of your estimator from (b.), and use the Normal approximation to compute a \(95\%\) confidence interval for the area based on this standard deviation.
# i have used the code fron (b.) in the function in order to have multiple
# estimators of the "orange area" in order to calculate the variance
monte_carlo <-function(n,r){
sam_vec_in_x<-vector()
sam_vec_in_y <- vector()
sam_vec_out_x <-vector()
sam_vec_out_y <- vector()
counter <-1
# using for loops and conditions divaid the samples in to points in 
# the four cerciles  and points that are out of the four cercules 
while( counter< n) {
  sam_x <- runif(1,-2*r,2*r)
  sam_y <- runif(1,-2*r,2*r)
if( (sam_x-r)^2 + (sam_y)^2 <= r^2) {
  sam_vec_in_x<-c(sam_vec_in_x,sam_x)
  sam_vec_in_y <-c(sam_vec_in_y,sam_y)
  counter <- counter +1
}else if( sam_x^2 + (sam_y-r)^2 <= r^2) {
  sam_vec_in_x<-c(sam_vec_in_x,sam_x)
  sam_vec_in_y <-c(sam_vec_in_y,sam_y)
  counter <- counter +1
}else if((sam_x+r)^2 + (sam_y)^2 <= r^2) {
  sam_vec_in_x<-c(sam_vec_in_x,sam_x)
  sam_vec_in_y <-c(sam_vec_in_y,sam_y)
  counter <- counter +1
} else if( (sam_x^2) + (sam_y+r)^2 <= r^2) {
  sam_vec_in_x<-c(sam_vec_in_x,sam_x)
  sam_vec_in_y <-c(sam_vec_in_y,sam_y)
  counter <- counter +1
}
if(sam_x^2 + sam_y^2 <= 4*r^2 & (sam_x^2) + (sam_y+r)^2 >= r^2 & (sam_x+r)^2 + (sam_y)^2 >= r^2 &  sam_x^2 + (sam_y-r)^2 >= r^2 & (sam_x-r)^2 + (sam_y)^2 >= r^2 ){
  sam_vec_out_x <- c(sam_vec_out_x,sam_x)
  sam_vec_out_y <- c(sam_vec_out_y,sam_y)
  counter <- counter +1
}
   
}
area <- pi*(2*r)^(2)
orange_area <-area* length(sam_vec_out_x)/50000
return(orange_area)
}


# in order to  calculate the standart diviation
# i have generated 50 estimators of the orange-area
# and calculate their variance

# computing the the area 50 times to compute the variance
monte_vector <- c(rep(0,50))
for (i in 1:length(monte_vector)) {
  monte_vector[i]<- monte_carlo(50000,r)
}

# standart diviation calculation
monte_mean <- mean(monte_vector)
variance <- sum((monte_vector - monte_mean)^2)/50
standart_div <- sqrt(variance)
round(standart_div,3)
## [1] 1.445
#confidence interval to the "orange area"
upper <- orange_area + qnorm(0.975)*standart_div
lower <- orange_area - qnorm(0.975)*standart_div
c(round(lower,3),round(upper,3))
## [1] 142.046 147.709

Solutions:

Q2. Analysis and Visualization of the COVID-19 Data (45 pt)

We would like to compare and visualize the trends in terms of numbers of COVID-19 cases and deaths between different countries.

  1. (6pt) Read the COVID-19 dataset file WHO-COVID-19-global-data.csv from the World’s Health Organization webpage (see the link Download Map Data on the bottom right corner of the map).

The data represents the daily number of cases and deaths from COVID19 in different world countries.

Change the column representing the date to Date. Make sure that this column represents only the date and set it to ‘date’ type. For example, the first element in the ‘Date’ column should be “2020-02-24”.

Show the head and tail of the resulting data-frame.

covid_data <- read.csv(url("https://covid19.who.int/WHO-COVID-19-global-data.csv"),comment.char = "#")
covid_data$Date_reported <- as.Date(covid_data$Date_reported)

head(covid_data,10)
##    Date_reported Country_code     Country WHO_region New_cases Cumulative_cases
## 1     2020-02-24           AF Afghanistan       EMRO         1                1
## 2     2020-02-25           AF Afghanistan       EMRO         0                1
## 3     2020-02-26           AF Afghanistan       EMRO         0                1
## 4     2020-02-27           AF Afghanistan       EMRO         0                1
## 5     2020-02-28           AF Afghanistan       EMRO         0                1
## 6     2020-02-29           AF Afghanistan       EMRO         0                1
## 7     2020-03-01           AF Afghanistan       EMRO         0                1
## 8     2020-03-02           AF Afghanistan       EMRO         0                1
## 9     2020-03-03           AF Afghanistan       EMRO         0                1
## 10    2020-03-04           AF Afghanistan       EMRO         0                1
##    New_deaths Cumulative_deaths
## 1           0                 0
## 2           0                 0
## 3           0                 0
## 4           0                 0
## 5           0                 0
## 6           0                 0
## 7           0                 0
## 8           0                 0
## 9           0                 0
## 10          0                 0
tail(covid_data,10)
##       Date_reported Country_code  Country WHO_region New_cases Cumulative_cases
## 28627    2020-07-07           ZW Zimbabwe       AFRO        18              734
## 28628    2020-07-08           ZW Zimbabwe       AFRO        53              787
## 28629    2020-07-09           ZW Zimbabwe       AFRO        98              885
## 28630    2020-07-10           ZW Zimbabwe       AFRO        41              926
## 28631    2020-07-11           ZW Zimbabwe       AFRO        16              942
## 28632    2020-07-12           ZW Zimbabwe       AFRO        40              982
## 28633    2020-07-13           ZW Zimbabwe       AFRO         3              985
## 28634    2020-07-14           ZW Zimbabwe       AFRO        49             1034
## 28635    2020-07-15           ZW Zimbabwe       AFRO        30             1064
## 28636    2020-07-16           ZW Zimbabwe       AFRO        25             1089
##       New_deaths Cumulative_deaths
## 28627          1                 9
## 28628          0                 9
## 28629          0                 9
## 28630          3                12
## 28631          1                13
## 28632          5                18
## 28633          0                18
## 28634          1                19
## 28635          1                20
## 28636          0                20
  1. (7pt) In this sub-question and the next one, we’re interested in plotting the COVID-19 trends in Israel and its neighbors.

Extract as candidate neighbours all countries with WHO_region = EMRO. Add Israel and other neighbour countries that you notice are missing, and remove far away countries (e.g. Afghanistan, Djibouti). Use your best judgment in selecting which additional countries to remove, and keep the total number of neighbour countries at below \(15\).

Replace long country names by meaningful short names for better readability and graph appearance.

For example, if Venezuela (Bolivarian Republic of) was one of our neighbours, we would have replaced it by Venezuela.

Next, plot the cumulative number of cases as a function of the Date for these countries (one plot, a different curve per each country).

Repeat and show in a different plot the cumulative number of deaths for each of these countries.

# Extract as candidate neighbours all countries with `WHO_region = EMRO`
candidate_neighbours <- covid_data %>% filter(WHO_region == "EMRO")

# Adding `Israel` and other neighbour countries that are missing
additions <- covid_data %>% filter(Country %in% c("Israel","Turkey","Cyprus"))
total_corona <- rbind(candidate_neighbours,additions)

# remove far away countries

total_corona <- total_corona %>% filter(!Country %in% c("Afghanistan","Bahrain","Djibouti","Libya","Morocco","Pakistan","Somalia","Sudan","Tunisia","Iran (Islamic Republic of)"))

# Replace long country names by meaningful short names for better readability and graph appearance.
levels(total_corona$Country)[levels(total_corona$Country) %in% c("occupied Palestinian territory, including east Jerusalem","Syrian Arab Republic")]<- c("Palestine","Syria")

levels(covid_data$Country)[levels(covid_data$Country) %in% c("occupied Palestinian territory, including east Jerusalem","Syrian Arab Republic")]<- c("Palestine","Syria")


# plot the `cumulative` number of `cases` as a function of the `Date` for these countries (one plot, a different curve per each country)

com_case<-total_corona %>% group_by(Date_reported) %>% ggplot(aes(y=Cumulative_cases,x=Date_reported,group=Country,color=Country,text=Country)) + geom_line() + labs(title = "Cumulative cases as function of date",y="Cumulative cases",x="Date")

ggplotly(com_case,tooltip = c("x","y","text"))
#Repeat and show in a different plot the `cumulative` number of `deaths` for each of these countries. 

com_death <- total_corona %>% group_by(Date_reported) %>% ggplot(aes(y=Cumulative_deaths,x=Date_reported,group=Country,color=Country,text=Country)) + geom_line() + labs(title = "Cumulative deaths as function of date",y="Cumulative deaths",x="Date")

ggplotly(com_death,tooltip = c("x","y","text"))
  1. (10pt) Load the economic dataset for world countries which we used in lab1 from here.

Merge the two data-frames such that the new data-frame will keep the information in the COVID-19 data-frame, yet will also contain for each row the total population of each country in \(2018\).

Manually rename country names that do not match between the two datasets - you don’t have to change all names, but focus on including countries that come up in the analysis of (b.) and of the next sub-questions.

Create four new columns, respectively representing the number of cumulative cases and deaths per one million people, and the number of new daily cases and deaths per one million people.

For the same countries used in (b.), plot in two separate figures

the log-scaled cumulative number of cases and deaths per million, as a function of the Date.

Which countries seem to be doing the worst based on these plots? how is Israel coping compared to its neighbours?

#Load the economic dataset for world countries
economic_data <- read.csv(url("https://raw.githubusercontent.com/DataScienceHU/DataAnalysisR_2020/master/data/economic_data.csv"),comment.char = "#")


#Merge the two data-frames such that the new data-frame will keep the information in the COVID-19 data-frame, yet will also contain for each row the total population of each country in 2018

economic_data <- economic_data %>% filter(Series.Name == "Population, total")
colnames(economic_data)[1] <- "Country"


#Manually rename country names that do not match between the two datasets 
# I did not include small island countries because there are meaningless
# to the analysis

levels(economic_data$Country)[levels(economic_data$Country) %in% c("Yemen, Rep.","Egypt, Arab Rep.","West Bank and Gaza","Syrian Arab Republic","Bahamas, The","Congo, Rep.","Czech Republic","Gambia, The","Iran, Islamic Rep.","Kyrgyz Republic","Slovak Republic")] <- c("Bahamas","Congo","Czechia","Egypt","Gambia","Iran","Kyrgyzstan","Slovakia","Syria","Palestine","Yemen")

levels(covid_data$Country)[levels(covid_data$Country) %in% c("Bolivia (Plurinational State of)","Democratic Republic of the Congo","Cֳ´te dג€™Ivoire","Iran (Islamic Republic of)","Kosovo[1]","Lao People's Democratic Republic","Republic of Korea","Republic of Moldova","The United Kingdom","United Republic of Tanzania","United States of America","Venezuela (Bolivarian Republic of)","Viet Nam")] <-c("Bolivia","Cote d'Ivoire","Congo, Dem. Rep.","Iran","Kosovo","Lao PDR","Korea, Rep.","Moldova","United Kingdom","Tanzania","United States","Venezuela, RB","Vietnam")

eco_corona <- left_join(covid_data,economic_data,by = "Country")
## Warning: Column `Country` joining factors with different levels, coercing to
## character vector
colnames(eco_corona)[12]<- "pop_2018"

#Create four new columns, respectively representing the number of *cumulative* `cases` and `deaths` per one million people, and the number of *new* daily `cases` and `deaths` per one million people

eco_corona <- eco_corona %>% mutate(cumulative_cases_permil = (as.numeric(as.character(Cumulative_cases))/as.numeric(as.character(pop_2018)))*1000000)
## Warning: NAs introduced by coercion
eco_corona <- eco_corona %>% mutate(cumulative_deaths_permil = (as.numeric(as.character(Cumulative_deaths))/as.numeric(as.character(pop_2018)))*1000000)
## Warning: NAs introduced by coercion
eco_corona <- eco_corona %>% mutate(new_cases_permil = (as.numeric(as.character(New_cases))/as.numeric(as.character(pop_2018)))*1000000)
## Warning: NAs introduced by coercion
eco_corona <- eco_corona %>% mutate(new_deaths_permil = (as.numeric(as.character(New_deaths))/as.numeric(as.character(pop_2018)))*1000000)
## Warning: NAs introduced by coercion
#For the same countries used in (b.), plot in two separate figures
#the *log-scaled* `cumulative` number of `cases` and `deaths` per million, as a function of the Date.

total_eco_corona <- eco_corona %>% filter(Country %in% c("Israel","Turkey","Cyprus","Palestine","Syria","Egypt","Iraq","Jordan","Kuwait","Lebanon","Oman","Qatar","Saudi Arabia","United Arab Emirates","Yemen"))

total_eco_corona$cumulative_cases_permil <-round(log(total_eco_corona$cumulative_cases_permil + 1),3)

com_case_mil <-total_eco_corona %>% group_by(Date_reported) %>% ggplot(aes(y=cumulative_cases_permil,x=Date_reported,group=Country,color=Country,text=Country)) + geom_line() + labs(title = "Cumulative number of cases per million as function of date(log scale)",x="Date",y = "Cumulative number of cases per million log scale" )

ggplotly(com_case_mil,tooltip = c("x","y","text"))
total_eco_corona$cumulative_deaths_permil <- round(log(total_eco_corona$cumulative_deaths_permil + 1),3)
com_deaths_mil <- total_eco_corona %>% group_by(Date_reported) %>% ggplot(aes(y=cumulative_deaths_permil,x=Date_reported,group=Country,color=Country,text=Country)) + geom_line() + labs(title = "Cumulative number of deaths per million as function of date(log scale)",x="Date",y = "Cumulative number of deaths per million log scale" )

ggplotly(com_deaths_mil,tooltip = c("x","y","text"))


Answer

The countries that seems to be doing the worst based on these plots
in terms of cases per million are:
Qatar,Kuwait,Oman,Saudi Arabia and the United Arab Emirates(by that order).
The countries that seems to be doing the worst based on these plots
in terms of deaths per million are:
Kuwait, Iraq, Saudi Arabia,Turkey and Oman(by that order).

Israel seems to be the 6th country in terms of highest number of cases per million
and 7th country in terms of highest number of deaths per million.
all in all very high compared to it’s neighbors.

  1. (7pt) One measure of the healthcare system strength in a country is the ratio of the number of deaths to the number of cases (with the caveats that this number is affected by other things like

the population age-structure, the fact that testing and diagnosing cases are different between countries, and that the pandemic is still ongoing and more deaths are expected in the future for current cases).

Calculate for each country the total cases per million and total deaths per million (It is recommended to create a new data-frame with one row per country), and make a scatter-plot comparing the two shown on a log-scale.

Fit a linear regression line to the log-scaled data.

Define the fatality rate as the ratio of the total deaths to the total cases. Display the distribution of fatality rate across countries using a plotting method of your choice. Describe the distribution: what is the mean/median? is it symmetric? skewed? are there outlier countries? which ones?

#Calculate for each country the `total cases per million` and `total deaths per million`  (It is recommended to create a new data-frame with one row per country), and make a scatter-plot comparing the two shown on a `log-scale`.

health <- eco_corona %>% group_by(Country) %>% summarise(total_cases_per_million = sum(new_cases_permil),total_deaths_per_million = sum(new_deaths_permil))
health <- na.omit(health)
health$total_cases_per_million <- round(log(health$total_cases_per_million + 1),3)
health$total_deaths_per_million <- round(log(health$total_deaths_per_million + 1),3)


health_plot <-  ggplot(health,aes(y=total_cases_per_million,x=total_deaths_per_million)) + geom_point(aes(color =Country))+geom_smooth(method = "lm",se =F) + labs(title = "total cases and deaths per million by country log scale",x = "log total deaths",y = "log total cases")

fig <-ggplotly(health_plot)
## `geom_smooth()` using formula 'y ~ x'
fig 
fatality_rate <- eco_corona %>% group_by(Country) %>% summarise(total_cases = sum(New_cases),total_deaths = sum(New_deaths)) %>% mutate(fat_rate = (total_deaths)/(total_cases))
fatality_rate_distri <- fatality_rate %>% ggplot(aes(fat_rate)) + geom_density() + labs(title = "Fatality rate distribution",x="Fatality rate")
fatality_rate_distri

paste("the mean of the distribution is ",round(mean(fatality_rate$fat_rate),3))
## [1] "the mean of the distribution is  0.032"
paste("the median of the distribution is ",round(median(fatality_rate$fat_rate),3))
## [1] "the median of the distribution is  0.021"
paste("the skewness of the distribution is ",round(skewness(fatality_rate$fat_rate),3))
## [1] "the skewness of the distribution is  2.826"
quan <- quantile(fatality_rate$fat_rate,prob=seq(0.9,1,0.01))
outliers <- fatality_rate %>% filter(fat_rate >= quan[10])
outliers
## # A tibble: 3 x 4
##   Country      total_cases total_deaths fat_rate
##   <chr>              <int>        <int>    <dbl>
## 1 France            163157        30015    0.184
## 2 Sint Maarten          79           15    0.190
## 3 Yemen               1530          434    0.284


Answer

It seems that the fatality rate distribution is not symmetrical hence it has high
skewness. Also, i for the outliers i have calculated the 99% of the distribution
in order to find countries with extreme values of fatality rate.
the outliers countries that i found are France,Sint Maarten and Yemen.

  1. (7pt) Find the countries worst hit by COVID-19, i.e. those with \(>10000\) total deaths.

For these countries, plot the smoothed number of new daily cases.

You can use the geom_smooth function.

Identify at least three different qualitative behaviors of the curves of the different countries. Which countries seemed to have overcome the pandemic?

in which countries it is still rising? are their countries with a second wave?

Do you see different patterns between different countries?

worst_hit <- covid_data %>% group_by(Country) %>% summarise(total_death = sum(New_deaths))
worst_hit <- worst_hit %>% filter(total_death>10000)
worst_hit <- as.vector(worst_hit$Country)

smood_data <- covid_data %>% filter(Country %in% worst_hit) %>% select(Date_reported,New_cases,Country)
smood_data$Date_reported <- as.Date(smood_data$Date_reported)
ww<-smood_data %>% ggplot(aes(y=New_cases,x=Date_reported,group = Country,color = Country,text = Country)) + geom_smooth(se =F) + labs(title = "smoothed new daily cases",x="Date",y="Cases") + facet_wrap(~smood_data$Country)+ theme(axis.title.x =element_text(color="black",size=15,face="bold"),axis.text.x =element_text(color="black",size=9,face="bold",angle = 90, vjust = 0.5,hjust = 1),axis.text.y =element_text(color="black",size=10,face="bold"),axis.title.y = element_text(color="black",size=15,face="bold"),title = element_text(color="black",size=15,face="bold")) 
ww
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


Answer

It seems that the daily cases are still rising in Brazil,United States,Mexico,India,Iran and a small rise in Spain.

Also, it is possible to see that no country is experiencing a second wave hence
there is no country that had a complete drop(to zero) in daily cases that followed
by an increase in daily cases

Moreover, it seems that europium countries have a slimier patterns but the non european countries(USA,India,Brazil,Peru and Iran) have unique
patterns

  1. (8pt) Our goal is to separate automatically the countries into groups: countries that have passed the pandemic first wave peak and are currently almost disease-free,

countries that are at the peak of the disease, and countries that have passed the peak but still see many cases, or are facing a ‘second wave’.

Create a new column for the merged data-frame containing a smoothed version of the daily number of new cases per million, where smoothing is performed using

moving-average with a moving-average “window” of width two-weeks (14 days). You can use for example the runmean command from the caTools package.

For each country compute the maximum number of smoothed new daily cases, representing the average number of new daily cases at the peak of the epidemic.

Similarly, compute for each country the last number of smoothed new daily cases, representing the current average number of daily cases. Make sure that the last number is indeed an average of the last \(14\) days and not fewer days (e.g. a single day).

i.e. it seems that the country still hasn’t reached its peak.

i.e. it seems that the country has passed its peak, but the epidemic is still active.

Determine the countries belonging to each of the three groups: recovered, exponentially increasing and struggling

(many countries do not belong to any of these three groups).

Next, for each of the three groups, make a figure showing all countries in this group in separate plots.

For each country, plot the smoothed number of new daily cases, divided by its max, as a function of the Date. Do the curves representing the different countries look similar within a group? do they look similar for countries from different groups?

For which group can you spot an indication for the start of a second wave of the pandemic? for which countries?

#Create a new column for the merged data-frame containing a smoothed version of the daily number of new cases per million, where smoothing is performed using 
#*moving-average* with a *moving-average*  "window" of width two-weeks (14 days)

eco_corona <- eco_corona %>% group_by(Country) %>% mutate(move_avg = runmean(new_cases_permil,14,endrule = "constant"))

# compute the maximum and last number of smoothed new daily cases
last <- eco_corona %>% filter(Date_reported == max(Date_reported)) %>% select(Country,move_avg)
colnames(last)[2] <- "last"
maximum <- eco_corona %>% group_by(Country) %>% mutate(maximum = max(move_avg)) %>% select(Country,maximum)
max_last <- left_join(last,maximum %>% unique(),"Country")

#Determine the countries belonging to each of the three groups:  `recovered`, `exponentially increasing` and `struggling`
recovered <- max_last %>% filter((last/maximum)<0.01) %>% select(Country)
struggling <-max_last %>% filter((last/maximum)>0.25 & (last/maximum)<0.75 ) %>% select(Country)
exponentially_increasing <- max_last %>% filter((last/maximum)>0.99) %>% select(Country)

#for each of the three groups, make a figure showing all countries in this group in separate plots. 

eco_corona <- data.frame(eco_corona,maximum$maximum)
div_max <- (eco_corona$move_avg)/(eco_corona$maximum)
eco_corona <- data.frame(eco_corona,div_max)

recovered_plot <- eco_corona %>% filter(Country %in% recovered$Country)
recovered_plot %>%  group_by(Country) %>% ggplot(aes(y=div_max,x=Date_reported)) + geom_line() + facet_wrap(~recovered_plot$Country) + theme(axis.title.x =element_text(color="black",size=15,face="bold"),axis.text.x =element_text(color="black",size=9,face="bold",angle = 90, vjust = 0.5,hjust = 1),axis.text.y =element_text(color="black",size=10,face="bold"),axis.title.y = element_text(color="black",size=10,face="bold"),title = element_text(color="black",size=15,face="bold"),strip.text = element_text(size = 7)) + labs(title = "recovered countries",x="Date",y="new daily cases divided by max")  

struggling_plot <- eco_corona %>% filter(Country %in% struggling$Country)
struggling_plot %>%  group_by(Country) %>% ggplot(aes(y=div_max,x=Date_reported)) + geom_line() + facet_wrap(~struggling_plot$Country) + theme(axis.title.x =element_text(color="black",size=15,face="bold"),axis.text.x =element_text(color="black",size=9,face="bold",angle = 90, vjust = 0.5,hjust = 1),axis.text.y =element_text(color="black",size=5),axis.title.y = element_text(color="black",size=10,face="bold"),title = element_text(color="black",size=10,face="bold"),strip.text = element_text(size = 5)) + labs(title = "struggling countries",x="Date",y="new daily cases divided by max") 

exponentially_increasing_plot <- eco_corona %>% filter(Country %in% exponentially_increasing$Country)
exponentially_increasing_plot %>%  group_by(Country) %>% ggplot(aes(y=div_max,x=Date_reported)) + geom_line() + facet_wrap(~exponentially_increasing_plot$Country) + theme(axis.title.x =element_text(color="black",size=15,face="bold"),axis.text.x =element_text(color="black",size=9,face="bold",angle = 90, vjust = 0.5,hjust = 1),axis.text.y =element_text(color="black",size=5),axis.title.y = element_text(color="black",size=10,face="bold"),title = element_text(color="black",size=10,face="bold"),strip.text = element_text(size = 5)) + labs(title = "exponentially increasing countries",x="Date",y="new daily cases divided by max") 


Answer
In the recovered group:
most of the countries have the same type curves which include
, a steep ascent followed by a steep descent.

In the struggling group:
there are plenty different types of curves but it seems
that there are a lot of countries that are balancing around a specific value until the month
of July.

In the exponentially increasing group:
most of the countries have the same type curves which include
, a steep ascent over time

The group which i can spot an indication for the start of a second wave is the
struggling group.
This group has many countries with patterns of a second wave(steep ascent followed by a steep descent and a beginning of another ascent)
It can be well spotted in countries like Japan,Australia,Fiji and more.

Solutions:

Q3. Linear Congruential Generator (35 pt)

A Linear Congruential Generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a modular linear equation. The method represents one of the oldest and best-known pseudo-random number generator algorithms. The theory behind them is relatively easy to understand, and they are easy to implement and fast, especially on computer hardware which can provide modular arithmetic by storage-bit truncation.

(source: Wikipedia)

The generator is defined by the recurrence relation:

\(X_{n + 1} = (a X_n + c) \: mod \: (m)\)

where \(X_1,X_2,...\) is the sequence of pseudo-random values in the range \([0, m)\). All values \(a,c,m,X_i\) are nonnegative integers and:

\(m\) is the “modulus”, \(m > 0\).

\(a\) is the “multiplier”, \(0 < a < m\).

\(c\) is the “increment”, \(0 \leq c < m\).

\(X_0\) is the “seed” or “start value”, \(0 \leq X_0 < m\).

To produce pseudo-random numbers in the range \([0,1]\) we can simply divide and output the numbers \(\frac{X_i}{m-1}\).

The following visual example shows generating sequences of random numbers using the LCG for different parameters and seeds.

  1. (6pt)

Write your own LCG function that implements an LCG-based pseudo-random number generator.

The function should accept the parameters \(m\), \(a\), \(c\), the current state \(X_0\) of the LCG and the number of random numbers \(n\) to generate.

The function should advance the state \(n\) times and return a vector of \(n\) pseudo-random numbers in the range \([0,1]\) based on the states, and also return the final state of the LCG \(X_n\) (i.e. the new seed).

my_LCG <- function(m,a,c,x_zero,n){
  rand_vec <- c(rep(0,n))
  for (i in 1:n) {
    x_new <- (a*x_zero + c)%%m
    rand_vec[i] <- (x_new)/(m-1)
    x_zero <- x_new
  }
  return(c(rand_vec,paste("new seed is =",rand_vec[n]*(m-1))))
}
  1. (5pt) A particular case of the LCG has been implemented by IBM and had been very popular

since the 1960’s. In this IBM.LCG the modulus is \(m=2^{31}\), the multiplier

is \(a=2^{16}+3\), and the increment is \(c=0\).

Set the seed to your ID number (‘teudat zehut’), generate \(2000\) consecutive pseudo-random numbers from the IBM.LCG and

divide them to \(1000\) consecutive pairs denoted \((x_i,y_i), i=1,...,1000\).

Create a scatter plot of the pairs. Does the spread of the points in the \([0,1]^2\) square seem to match i.i.d. uniform samples from this square?

IBM.LGC<- my_LCG(m=2^31,a =2^(16) + 3,c = 0,x_zero = 206094278,n = 2000)
IBM.LGC<-IBM.LGC[1:2000]
IBM.LGC_xi <-IBM.LGC[seq(1,1999,2)]
IBM.LGC_yi <-IBM.LGC[seq(2,2000,2)]
ps_data <- data.frame(round(as.numeric(as.character(IBM.LGC_xi)),3),round(as.numeric(as.character(IBM.LGC_yi)),3))
colnames(ps_data)[1:2] <-c("x","y")
plot(ps_data$x,ps_data$y,pch = 19,cex =0.5,main = "Uniform sample from IBM.LCG",xlab = "x",ylab = "y")

#create a uniform samples of a square by using runif
x<- runif(1000,0,1)
y <-runif(1000,0,1)
plot(x,y,pch=19,cex=0.5,main = "Uniform sample from runif ")


Answer
by matching the two plots(one from IBM.LGC and one from runif function)
the two plots seems a like!
So the spread of the points in the [0,1]^2 square seem to match i.i.d. uniform
samples from this square

  1. (6pt) Divide the unit square into \(B=10^2\) square bins of equal size. For each bin compute the expected number \(e_{ij}\) of points

in a sample of \(n\) points, vs. the observed number \(o_{ij}\) in your simulated data from (b.). Compute the goodness-of-fit test statistic for the data:

$$

S = _{i,j=1}^{10} .

$$

For the null hypothesis \(H_0: (x_i, y_i) \sim [0,1]^2 \: i.i.d.\) we know that (approximately) \(S \sim \chi^2(B-1)\). Compute a p-value for this hypothesis testing problem and the statistic \(S\) you have computed. Would you reject the null hypothesis or uniform distribution?

B <- 10
expected <- c(rep(1000/B^2,100))
observed <- matrix(data=0, B,B)
for (i in 1:B) {
  for(j in 1:B){
    observed[i,j]<- sum(as.double( ((i-1)/B<= ps_data[,1]) & (ps_data[,1] < i/B) & ((j-1)/B <= ps_data[,2]) &  (ps_data[,2] < j/B) ))
  }
}
statistic <-  sum(rowSums((observed - expected)^2 / expected))
statistic
## [1] 84.2
print(paste0("p value is ",round(1 - pchisq(statistic,B^2-1),3)))
## [1] "p value is 0.856"


Answer
The p value >0.05 therefore we will accept the null null hypothesis \(H_0: (x_i, y_i) \sim [0,1]^2 \: i.i.d.\)

  1. (6pt) Draw \(n_{iters}=10000\) times a sample of \(n=1000\) i.i.d. points from the uniform distribution over the \([0,1]^2\) square (using runif) and compute the statistic \(S\) from above for each sample, generating random statistics \(S_1,..,S_{n_{iters}}\).

Plot the empirical density distribution of the \(S_i\) test statistics and compare it to the theoretical density of the \(\chi^2(B-1)\) distribution. Are the distributions similar?

Compute the empirical p-value of the test statistic \(S\) from (c.), defined as \(1-\hat{F}_{n_{iters}}(S)\)

where \(\hat{F}_{n_{iters}}\) is the empirical CDF of \(S_1,..,S_{n_{iters}}\). Does it match the theoretical p-value from (c.)?

#Plot the empirical density distribution of the $S_i$ test statistics and compare it to the theoretical density of the $\chi^2(B-1)$ distribution

statisti_s <- function(n){
x <- runif(n,0,1)
y <- runif(n,0,1)
data <- cbind(x,y)
B <- 10
expected <- c(rep(n/B^2,B^2))

observed <- matrix(data=0, B,B)
for (i in 1:B) {
  for(j in 1:B){
    observed[i,j]<- sum(as.double( ((i-1)/B<= data[,1]) & (data[,1] < i/B) & ((j-1)/B <= data[,2]) &  (data[,2] < j/B) ))
  }
}
statistic <-  sum(rowSums((observed - expected)^2 / expected))
return(statistic)  
}  

dist_vec <-c(rep(0,10000))
for (i in 1:length(dist_vec)) {
 dist_vec[i] <- statisti_s(1000)  
}

my_chisq <-data.frame(dist_vec)
ggplot(data.frame(R_chisq = c(50, 160)), aes(x = R_chisq,color = "R chisq")) + stat_function(fun = dchisq, args = list(df = 99)) + geom_density(data = my_chisq,aes(x=dist_vec,color="my chisq")) +labs(title = "empirical chi-square vs theoretical chi square",x="x") 

#Compute the empirical p-value of the test statistic $S$ from (c.), defined as
#$1-\hat{F}_{n_{iters}}(S)$ 

p_val <- sum(dist_vec>=statistic)
p_val <- p_val/10000
round(p_val,3)
## [1] 0.85


Answer

The two distributions are very similar, it can be seen directly from the plot

the empirical p value very similar to theoretical p-value,but it is not a full match to the theoretical p-value
due to the fact that empirical chi square distribution is an approximation of the
theoretical chi square distribution.

  1. (6pt) Repeat (b.) but this time use the LCG to generate consecutive triplets denoted \((x_i,y_i,z_i), i=1,...,1000\).

Create a 3-dimensional scatter plot of the simulated data using the package plotly and the command plot_ly.

Does the spread of the points in the \([0,1]^3\) cube seem to match i.i.d. uniform samples from this cube?

hint: you can rotate the 3-dim. plot in different directions. Look at the plot from the z-axis looking down on the x-axis and y-axis. What is your conclusion?

Note: the 3-dim. plot generated by plotly may not be observable when viewing the knitted html file from within \(R\). Use a web-browser like chrome, firefox etc. to view the html file and plot

IBM.LGC<- my_LCG(m=2^31,a =2^(16) + 3,c = 0,x_zero = 206094278,n = 3000)
IBM.LGC<-IBM.LGC[1:3000]
IBM.LGC_xi2 <-IBM.LGC[seq(1,2998,3)]
IBM.LGC_yi2 <-IBM.LGC[seq(2,2999,3)]
IBM.LGC_zi2 <-IBM.LGC[seq(3,3000,3)] 
ps_data2 <- data.frame(round(as.numeric(as.character(IBM.LGC_xi2)),3),round(as.numeric(as.character(IBM.LGC_yi2)),3),round(as.numeric(as.character(IBM.LGC_zi2)),3))
colnames(ps_data2)[1:3] <-c("x","y","z")

plot_ly(ps_data2,x= ~x,y= ~y,z= ~z,color = ~z,size = I(30))
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode


Answer
The spread of the points in the [0,1]^3 cube is not matching i.i.d. uniform samples from this cube.
If we watch the cube from the z axis down to the x and y axis we will notice a pattern of “blocks” of points like this;

The appearance of a noticable pattern means that the distribution of
points in the cube can not be random therefore
the points in the [0,1]^3 cube is not matching i.i.d. uniform samples

  1. (6pt) Why do you think this random number generator fell out of favor? correct it by changing the \(a\) and \(c\) parameters, and show that your correction indeed solves the problems encountered in the IBM.LCG.
IBM.LGC<- my_LCG(m=2^31,a =2^(16) +3^(12),c = 54321 ,x_zero = 206094278,n = 3000)
IBM.LGC<-IBM.LGC[1:3000]
IBM.LGC_xi3 <-IBM.LGC[seq(1,2998,3)]
IBM.LGC_yi3 <-IBM.LGC[seq(2,2999,3)]
IBM.LGC_zi3 <-IBM.LGC[seq(3,3000,3)] 
ps_data3 <- data.frame(round(as.numeric(as.character(IBM.LGC_xi3)),3),round(as.numeric(as.character(IBM.LGC_yi3)),3),round(as.numeric(as.character(IBM.LGC_zi3)),3))
colnames(ps_data3)[1:3] <-c("x","y","z")

plot_ly(ps_data3,x= ~x,y= ~y,z= ~z,color = ~z,size = I(30))
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode


Answer

This is due to correlation between values of the that were created by the LCG.
this correlation is created by allotting “m” variable and “a” variable that are
linear(or very close) dependence.
Therefore i will change “a” and “c” in order to create linear independence between the Xn(reduce the linear dependence between “m” and “a”).

As we can see in the new plot,pattern of “blocks” is unnoticed.
and that means that the change in parameters of the IBM.LCG solves the problems encountered from before

Solutions: